Learning outcomes

  • Review

  • Multiple factors

  • Interactions

Review

Review

The linear model estimates the mean of the outcome given the covariates.

  • the mean body weight for a given heart girth.

  • the mean body weight given heart girth, umbilical girth, length and sex.

  • the mean right wing length for different areas.

It also gives uncertainties, i.e. confidence intervals.

Review

  • For factor (grouping) variables, it estimates a separate mean for each level.

  • The first level (baselines) goes into the intercept.

  • Each other level is represented by the difference to the baseline.

  • The overall F-test tells us whether the grouping variable is important.

The F-statistic for factors

\[ F = \frac{\mathsf{Variation\ explained}}{\mathsf{Variation\ unexplained}} = \frac{(\sigma^2_\mathsf{total} - \sigma^2_\mathsf{res})/p}{\sigma^2_\mathsf{res}/(n-p-1)} \]

  • The numerator is variation explained by the model. In this case this is variation between groups, as the model is just giving a different mean to each group.

  • The denominator is residual variation, which in this case is variation within groups (residuals are values minus fit, which is the group mean).

  • Thus F is a ratio of between-group variation to within-group variation.

Example: Petrels

Example: Petrels

Call:
lm(formula = R.Wing.Lth ~ Area, data = petrels)

Residuals:
    Min      1Q  Median      3Q     Max 
-37.600  -5.486   0.400   5.404  26.400 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 389.6000     0.5432 717.238  < 2e-16 ***
Area2        -6.0036     0.8077  -7.433 3.09e-13 ***
Area3        -5.2250     2.2966  -2.275 0.023199 *  
Area4        -6.6893     1.3106  -5.104 4.29e-07 ***
Area5         4.1462     0.9528   4.351 1.55e-05 ***
Area6        -9.2667     2.6332  -3.519 0.000461 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.926 on 701 degrees of freedom
  (15 observations deleted due to missingness)
Multiple R-squared:  0.1719,    Adjusted R-squared:  0.166 
F-statistic: 29.11 on 5 and 701 DF,  p-value: < 2.2e-16

Example: Petrels

  • The overall F-test’s P-value suggests average right wing length differs between areas.

  • Summary table suggests that Area 5 has largest birds, having 4.14mm larger ring wing lengths compared to those in Area 1.

  • Birds in Area 6 are smallest, having 9.27 mm smaller right wing lengths compared to those in Area 1.

  • We can get confidence intervals for the mean using the predict function

Example: Petrels

new_data <- data.frame(Area=factor(1:6))
predict(mod, new_data, interval="confidence")
       fit      lwr      upr
1 389.6000 388.5335 390.6665
2 383.5964 382.4229 384.7699
3 384.3750 379.9940 388.7560
4 382.9107 380.5690 385.2525
5 393.7462 392.2092 395.2831
6 380.3333 375.2746 385.3921

Visualising the differences

library(visreg)
visreg(mod)

Multiple factors

What if we have more than one factor?

  • We include a block of indicator variables for each factor.

  • The first level, or baseline of each of the factors is what contributes to the intercept.

  • Subsequent levels are treated as differences compared to the baseline level.

  • The overall F-test and P-value for the model is testing whether any of the factors in the model are important.

  • We need another way to compute P-values for each factor separately.

Summary output for two factors

Call:
lm(formula = R.Wing.Lth ~ Sex + Area, data = petrels)

Residuals:
    Min      1Q  Median      3Q     Max 
-36.950  -5.192   0.050   5.808  25.808 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 388.1492     0.8451 459.295  < 2e-16 ***
SexMale       2.0432     0.8369   2.441 0.014884 *  
Area2        -6.2020     0.7969  -7.783 2.57e-14 ***
Area3        -6.9005     2.4102  -2.863 0.004322 ** 
Area4        -6.5154     1.2977  -5.021 6.55e-07 ***
Area5         3.7581     0.9447   3.978 7.67e-05 ***
Area6        -8.8374     2.6037  -3.394 0.000727 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.788 on 696 degrees of freedom
  (19 observations deleted due to missingness)
Multiple R-squared:  0.1866,    Adjusted R-squared:  0.1796 
F-statistic: 26.61 on 6 and 696 DF,  p-value: < 2.2e-16

The anova function

mod2 <- lm(R.Wing.Lth ~ Sex + Area, data=petrels)
anova(mod2)
Analysis of Variance Table

Response: R.Wing.Lth
           Df Sum Sq Mean Sq F value    Pr(>F)    
Sex         1   1107 1106.58  14.328 0.0001668 ***
Area        5  11224 2244.84  29.066 < 2.2e-16 ***
Residuals 696  53754   77.23                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • Each row in the anova table corresponds to a factor (or numeric covariate).

  • P-values thus can be read out.

Process of assessing a factor covariate

  • Fit linear model.

  • Do anova() to check the factor is important (small P).

  • If it is important, look at the summary() to see how the levels differ (or visreg() to visualise them).

  • Multiple testing problem on the latter, which is why we do the overall test first.

  • Order can be important in the anova table.

  • Each row’s P-values are adjusting for previous rows.

Example: Order matters

anova(lm(R.Wing.Lth ~ Sex + Area, data=petrels))
Analysis of Variance Table

Response: R.Wing.Lth
           Df Sum Sq Mean Sq F value    Pr(>F)    
Sex         1   1107 1106.58  14.328 0.0001668 ***
Area        5  11224 2244.84  29.066 < 2.2e-16 ***
Residuals 696  53754   77.23                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Example: Order matters

anova(lm(R.Wing.Lth ~ Area + Sex, data=petrels))
Analysis of Variance Table

Response: R.Wing.Lth
           Df Sum Sq Mean Sq F value  Pr(>F)    
Area        5  11870 2374.10 30.7393 < 2e-16 ***
Sex         1    460  460.29  5.9597 0.01488 *  
Residuals 696  53754   77.23                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA table

  • Each row represents a single variable (factor or numeric).

  • Each row adjusted for what is above it (but not below).

  • Order can be important but probably tells you something if it changes things.

  • If a factor is important, look at summary table to see which groups differ (or visualise model).

Petrels: Summary table

Call:
lm(formula = R.Wing.Lth ~ Sex + Area, data = petrels)

Residuals:
    Min      1Q  Median      3Q     Max 
-36.950  -5.192   0.050   5.808  25.808 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 388.1492     0.8451 459.295  < 2e-16 ***
SexMale       2.0432     0.8369   2.441 0.014884 *  
Area2        -6.2020     0.7969  -7.783 2.57e-14 ***
Area3        -6.9005     2.4102  -2.863 0.004322 ** 
Area4        -6.5154     1.2977  -5.021 6.55e-07 ***
Area5         3.7581     0.9447   3.978 7.67e-05 ***
Area6        -8.8374     2.6037  -3.394 0.000727 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.788 on 696 degrees of freedom
  (19 observations deleted due to missingness)
Multiple R-squared:  0.1866,    Adjusted R-squared:  0.1796 
F-statistic: 26.61 on 6 and 696 DF,  p-value: < 2.2e-16

Petrels: Conclusion

  • From ANOVA table, both Area and Sex are important to the right wing length.

  • From summary table, we see that males have 2.04mm larger right wing length than females after accounting for differences between areas.

  • This doesn’t mean that male birds have a 2.04mm larger right wing length than females.

  • In fact

    tapply(petrels$R.Wing.Lth, petrels$Sex, mean, na.rm=TRUE)
      Female     Male 
      385.2222 388.3309 
      

    they’re 3.11mm larger across all areas.

Petrels: Conclusion

  • Similarly, birds in area 5 have 3.76mm larger right wing length than those in area 1 after accounting for differences in sex.

  • Again,

    tapply(petrels$R.Wing.Lth, petrels$Area, mean, na.rm=TRUE)
           1        2        3        4        5        6 
      389.6000 383.5964 384.3750 382.9107 393.7462 380.3333 
      

    they’re 4.15mm larger across both sexes.

Summary

  • Always look at the anova table first, particularly if you have factors.

  • Look at the summary table next, and interpret the estimated coefficients there for the significant variables from the anova table.

  • Use visreg to visualise the effects - easier to see effect sizes/uncertainty.

  • Remember that ‘not significant’ does not mean ‘not important’.

Interactions

Interactions

  • Our current models have been assuming that each effect is additive.

  • e.g. the difference in the right wing length between the sexes is common across all areas, and similarly the area effects are common for males and females.

  • This is often not the case.

Example: Hypothetical drugs

  • Suppose we have some medical condition and we have a numerical measure describing how good/bad a patient is.

  • Suppose further that we have two drugs A and B to treat that condition, and that we have the option of combining them.

  • Then there are 4 possible treatments: No drugs, just drug A, just drug B, or both drugs.

  • Our current linear models, would assume that we can add the effect of drug A and B together for the “both drugs” case.

Example: Hypothetical drugs

The model equation would be \[ \mathsf{mean}(y) = \alpha + \beta_A z_A + \beta_B z_B \] where \(z_A\) and \(z_B\) are indicators for whether drug A or B are included, giving

Treatment Mean
Neither drug \(\mathsf{mean}(y)=\alpha\)
Just drug A \(\mathsf{mean}(y)=\alpha + \beta_A\)
Just drug B \(\mathsf{mean}(y)=\alpha + \beta_B\)
Both drugs \(\mathsf{mean}(y)=\alpha + \beta_A + \beta_B\)

Example: Hypothetical drugs

  • For the combined drugs treatment, we have \(\mathsf{mean}(y)=\alpha + \beta_A + \beta_B\)

  • We’re thus assuming that the individual benefit of each drug add together when combined.

  • But, it might be that the combined effect of the drugs gives is smaller or larger than the sum of individual effects.

  • We need a bit more flexibility in the model.

Example: Hypothetical drugs

Interactions: Hypothetical drugs

We add another indicator variable \(z_{AB}\) which is 1 when both drugs are in use. \[ \mathsf{mean}(y) = \alpha + \beta_A z_A + \beta_B z_B + \beta_{AB} z_{AB} \]

Mathematical aside: the new variable is the product of the existing ones, as \(z_{AB}=1\) only if both \(z_A=1\) and \(z_B=1\). \(z_{A} \times z_{B}\) has this property.

The means for the neither drug or single drug treatments are still the same, as \(z_{AB}=0\) in those cases. The mean in the both drugs treatment is the only one that changes to \[ \mathsf{mean}(y) = \alpha + \beta_A + \beta_B + \beta_{AB}. \] Thus, \(\beta_{AB}\) can be used to measure whether or not the effects of the two drugs are additive or not.

Interactions: Hypothetical drugs

  • \(\beta_{AB}\) can be used to measure whether or not the effects of the two drugs are additive or not.

  • An alternate way to think about it is that the effect of drug B may differ depending on whether the patient is already given drug A.

  • e.g. the effect for drug B may be to increase the diagnostic measurement by 5 units in the absence of other treatment, but if the patient is on drug A as well, then the effect of drug B may be less pronounced.

Example: Hypothetical drugs

Interactions in RStudio

mod3 = lm(R.Wing.Lth ~ Sex + Area + Sex:Area, data=petrels)
anova(mod3)
Analysis of Variance Table

Response: R.Wing.Lth
           Df Sum Sq Mean Sq F value    Pr(>F)    
Sex         1   1107 1106.58 14.3035  0.000169 ***
Area        5  11224 2244.84 29.0165 < 2.2e-16 ***
Sex:Area    5    296   59.15  0.7646  0.575486    
Residuals 691  53459   77.36                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As the interaction here is between factors, it should be evaluated using the ANOVA table.

Interactions in RStudio

  • From the anova table we see the interaction term is not significant (P=0.575).

  • Thus, there is no evidence in the data to suggest that the comparative size of male and female petrels differ between areas.

Summary table of interactions

Call:
lm(formula = R.Wing.Lth ~ Sex + Area + Sex:Area, data = petrels)

Residuals:
    Min      1Q  Median      3Q     Max 
-36.855  -5.264   0.145   5.722  26.081 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   389.1186     1.1451 339.811  < 2e-16 ***
SexMale         0.8000     1.2967   0.617  0.53746    
Area2          -8.3745     1.7636  -4.748 2.49e-06 ***
Area3         -13.6186     6.3240  -2.153  0.03163 *  
Area4          -7.1186     2.2350  -3.185  0.00151 ** 
Area5           3.6506     2.6949   1.355  0.17597    
Area6         -11.9520     3.7690  -3.171  0.00159 ** 
SexMale:Area2   2.7336     1.9775   1.382  0.16731    
SexMale:Area3   7.9500     6.8418   1.162  0.24565    
SexMale:Area4   0.6571     2.7524   0.239  0.81137    
SexMale:Area5   0.2855     2.8799   0.099  0.92107    
SexMale:Area6   5.5333     5.2411   1.056  0.29145    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.796 on 691 degrees of freedom
  (19 observations deleted due to missingness)
Multiple R-squared:  0.1911,    Adjusted R-squared:  0.1782 
F-statistic: 14.84 on 11 and 691 DF,  p-value: < 2.2e-16

Summary table of interactions

  • While the interaction term isn’t significant (thus all the interaction coefficients could be zero in the population) let’s assess what each row represents.

  • The Intercept contains the first levels of each factor, so females in area 1 have on average 389.12mm right wing lengths.

  • The SexMale term is the difference to the baseline, so this describes how males in area 1 differ. They have right wing lengths 0.8mm larger.

  • The Area2 term is the difference to the baseline, so this describes how females in area 2 differ from those in area 1. Their right wing lengths are 8.37mm smaller.

  • The SexMale:Area2 is the differential effect of males in area 2 over and above the SexMale and Area2 effects. Compared with females in area 1, males in area 2 are \(0.8+-8.37+2.73=-4.84\)mm larger.

Visualising models using visreg

  • When you have interactions, visualise the effect of one variable within the levels of another.

  • Allows you to see how effects interact.

  • Better way to present results from the model than summary table.

library(visreg)
visreg(mod3, "Area", by="Sex")
visreg(mod3, "Area", by="Sex", overlay=TRUE)

visreg: Model without interaction

visreg: Model with interaction

visreg: without interaction (overlay=TRUE)

visreg: with interaction (overlay=TRUE)

Other types of interaction

Consider \[ \mathsf{mean}(y) = \alpha + \beta_1 x + \beta_2 z \] where \(x\) is a numeric variable and \(z\) is an indicator variable for the 2nd level of a factor variable.

Then if we’re in the first level of the factor variable, \(z=0\) so the equation is \[ \mathsf{mean}(y) = \alpha + \beta_1 x \] If we’re in the second level, \(z=1\) so that \[ \mathsf{mean}(y) = \alpha + \beta_1 x + \beta_2 = (\alpha + \beta_2) + \beta_1 x. \]

The model fit is thus two parallel lines, separated by \(\beta_2\).

Example: Calfweights

Example: Calf weights

c1 = lm(Weight ~ BirthWeight + Breed + Treatment + Age, data=calf)
anova(c1)
Analysis of Variance Table

Response: Weight
             Df Sum Sq Mean Sq  F value    Pr(>F)    
BirthWeight   1  10952   10952  390.912 < 2.2e-16 ***
Breed         2    888     444   15.853 2.037e-07 ***
Treatment     1    657     657   23.438 1.685e-06 ***
Age           1 212268  212268 7576.597 < 2.2e-16 ***
Residuals   542  15185      28                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Example: Calf weights

visreg(c1, "Age", by="Treatment", overlay=TRUE, gg=TRUE)

Example: Calf weights

visreg(c1, "Age", by="Breed", overlay=TRUE, gg=TRUE)

Interaction of numeric variables and factors

We could add another term using the product of the numeric variable \(x\) and the indicator for the second level of the factor \(z\) \[ \mathsf{mean}(y) = \alpha + \beta_1 x + \beta_2 z + \beta_3 x z. \]

Then if we’re in the first level of the factor variable, \(z=0\) so the equation is \[ \mathsf{mean}(y) = \alpha + \beta_1 x \] whereas if we’re in the second level, \(z=1\) so that \[ \mathsf{mean}(y) = \alpha + \beta_1 x + \beta_2 + \beta_3 x = (\alpha + \beta_2) + (\beta_1 + \beta_3) x. \] The model fit is thus two separate lines.

Example: Calfweights

Example: Calf weights

c2 <- lm(Weight ~ BirthWeight + Breed + Treatment + Age +
           Age:Treatment + Age:Breed, data=calf)
anova(c2)
Analysis of Variance Table

Response: Weight
               Df Sum Sq Mean Sq  F value    Pr(>F)    
BirthWeight     1  10952   10952  420.039 < 2.2e-16 ***
Breed           2    888     444   17.034 6.707e-08 ***
Treatment       1    657     657   25.184 7.088e-07 ***
Age             1 212268  212268 8141.130 < 2.2e-16 ***
Treatment:Age   1    494     494   18.936 1.617e-05 ***
Breed:Age       2    637     319   12.224 6.427e-06 ***
Residuals     539  14054      26                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Calf weight summary

Call:
lm(formula = Weight ~ BirthWeight + Breed + Treatment + Age + 
    Age:Treatment + Age:Breed, data = calf)

Residuals:
     Min       1Q   Median       3Q      Max 
-22.3475  -2.7649   0.2259   2.8988  18.2773 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      -2.00532    2.37036  -0.846    0.398    
BirthWeight       0.89258    0.05668  15.749  < 2e-16 ***
BreedJE           0.70809    1.67030   0.424    0.672    
BreedXB           0.10417    1.12685   0.092    0.926    
TreatmentLow      0.73167    0.88960   0.822    0.411    
Age               0.85105    0.02004  42.468  < 2e-16 ***
TreatmentLow:Age -0.07329    0.01716  -4.271 2.30e-05 ***
BreedJE:Age      -0.13480    0.02816  -4.788 2.18e-06 ***
BreedXB:Age      -0.03091    0.02076  -1.489    0.137    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.106 on 539 degrees of freedom
Multiple R-squared:  0.9414,    Adjusted R-squared:  0.9406 
F-statistic:  1083 on 8 and 539 DF,  p-value: < 2.2e-16

Calf weights

visreg(c2, "Age", by="Breed", overlay=TRUE, gg=TRUE)

Calf weights

visreg(c2, "Age", by="Treatment", overlay=TRUE, gg=TRUE)

Diagnostics

Better model

c3 <- lm(log(Weight) ~ log(BirthWeight) + Breed + Treatment + Age + 
         Age:Treatment + Age:Breed, data=calf)
anova(c3)
Analysis of Variance Table

Response: log(Weight)
                  Df Sum Sq Mean Sq   F value    Pr(>F)    
log(BirthWeight)   1  3.613   3.613  599.9067 < 2.2e-16 ***
Breed              2  0.182   0.091   15.0937 4.187e-07 ***
Treatment          1  0.113   0.113   18.8347 1.703e-05 ***
Age                1 58.253  58.253 9672.8215 < 2.2e-16 ***
Treatment:Age      1  0.086   0.086   14.2626 0.0001767 ***
Breed:Age          2  0.070   0.035    5.8391 0.0030987 ** 
Residuals        539  3.246   0.006                        
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Better model: Diagnostics

Better model: Fit

visreg(c3, "Age", by="Breed", trans=exp, partial=TRUE, overlay=TRUE, ylab="Weight", gg=TRUE)

Better model: Fit

visreg(c3, "Age", by="Treatment", trans=exp, partial=TRUE, overlay=TRUE, ylab="Weight", gg=TRUE)

Another problem though!

  • We know here that we have multiple calves measured through time.

  • Residuals from the same calf likely more similar to each other than to those of other calves.

  • The residuals are unlikely to be independent.

  • In this case can check by plotting the residuals versus the calf identifier.

Another problem?

calf_pred = broom::augment(c3, data=calf)
ggplot(calf_pred, aes(x=Calf, group=Calf, y=.resid)) + geom_boxplot() + ylab("Residual")

Solution: See lab 4!

library(nlme)
mixed = lme(log(Weight) ~ log(BirthWeight) + Breed + Treatment + Age + 
            Breed:Age + Treatment:Age, random=~1|Calf, data=calf)

Learning outcomes

  • Review

  • Multiple factors

  • Interactions